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Abstract — "THIS PAPER IS ELIGIBLE FOR THE STUDENT 
PAPER AWARD". 

In this paper a numerical method is presented, which finds a 
lower bound for the mutual information between a binary and 
an arbitrary finite random variable with joint distributions that 
have a variational distance not greater than a known value to 
a known joint distribution. This lower bound can be applied to 
mutual information estimation with confidence intervals. 

I. Introduction 

A tight lower bound for the mutual information between 
a binary and an arbitrary finite random variable with joint 
distributions that have a variational distance not greater than 
a known value to a known joint distribution can be found by 
minimizing over this set of joint distributions. Unfortunately, 
in general this minimization problem is hard to solve, since 
the mutual information is not convex in the joint distribution. 

Therefore this minimization problem is split up into two 
subproblems. 

If the marginal probability of the binary random variable 
is fixed, then the mutual information can easily be minimized 
over the conditional probabilities of the second random vari- 
able, since the mutual information is convex in the conditional 
probabilities [1, Theorem 2.7.4] and the set of conditional 
probabilities is convex (see Theorem 1) and therefore this 
optimization problem is convex. This constitutes the first 
subproblem which can easily be solved by standard methods 
for convex optimization. 

In the second subproblem, having a closer look on the 
marginal probability distribution of the binary random vari- 
able, one first recognizes that this is only one-dimensional 
since the two probabilities have to sum up to 1. Next, the 
variational distance between the joint probabilities is greater 
or equal than the variational distance of the marginal proba- 
bilities, as is shown in (5). Therefore one can simply generate 
sufficiently many marginal probability distributions equidis- 
tantly in the one dimension left, solve the first subproblem for 
every of these marginal probability distributions and return the 
smallest mutual information calculated that way. 



In the next section the notation is fixed. In section III the 
details of the method are given. In section V some numerical 
examples are shown. 

II. NOTATIONAL SETUP 

Let X, Y be a pair of finite discrete random variables, with 
joint probability distribution 

Pxy = {pxy(i,j) : i = 1, 2, . . . , M x ; j = 1, 2, . . . , M y }. 

Here X G X and Y e y and it is w.l.o.g. assumed that 
X = {1,2,..., M,.} and that y = {1,2, . . . ,M y }. The 
marginal probability distributions are px — {px{i) '■ i = 
1,2,..., M x } and p Y = { PY (j) : j = 1,2,..., M y }. They 
are calculated from the joint probalility distributions as usual. 
The conditional probability distributions are 

Py\x = {PY\x(j\i) :i = l,2,...,M x] j = 1, 2, . . . , M y }, 
Px\y = {Px\Y{i\j) ■ i = 1, 2, . . . , M x ; j = 1, 2, . . . , M y }. 

It is defined that p Y \xPx = PxPy\x = Px\yPy = PyPx\y = 
Pxy- The product of the marginal distributions is denoted as 

PxPy = {px(i)p Y {j) ■ i = 1, 2, . . . , M x ; j = 1, 2, . . . , M y }. 

For any two joint probability distributions pxy, QXY the 
relative entropy or Kullback-Leibler distance [1] is defined as 

Af x M y pxvii j) 

D(pxy\\Qxy) = y^y^Pxy(«,j)log T^-r (1) 

and the mutual information between X and Y [1] as the 
relative entropy between the joint probability distribution and 
product of the marginal probability distributions of X and Y 



I{X:Y) = I(p XY ) = D{ Px y\\pxPy). 



(2) 



All logs are assumed to be natural if not stated otherwise. 

The variational distance between two joint probability dis- 
tributions is defined as 



V{pxY,qXY) 



Wpxy - qxvlh 



M x M v 

EE 



\pxv(i,j) - qxY(i,j)\, 



=1 j=i 



and similarly for the marginal distributions. It can be easily 
seen, that V(-, •) € [0, 2] for any two probability distributions. 

III. Results 

First it is shown that set of all conditional probability 
distributions constrained by a maximal variational distance is 
convex. 

Theorem 1: Let pxy — PxPy\x be any fixed joint prob- 
ability distribution of any two two discrete finite random 
variables X, Y, let qx be any fixed probability distribution 
of X and let e be any fixed number 6 [0,2]. Then the set 
Q = {Qy\x I V{q x qY\x,PXY) < e} is convex. 

Proof: Let q Y \ X i Q Y \x ^ e an y two conditional probability 
distributions 6 Q. Then one only has to show that the convex 

combination q Y \x = ^1y\x + (1 — ^)1y\X' w i m ^ e P> 1] 
is also in Q. Before this is done, it is defined that = 

<1xy = 1xl\\x and 9xy = Mxy + (1 - A )?ly = 
Now, to proof that <Zy| X G Q, one only has to show 

that V(q x q Y \xiPxy) < e- Herefore 



v (QxQy\x^Pxy) 



V(Qxy'Pxy) 



\Qxy 



' Pxy -i < e, 



(3) 



where the fact that any norm ball is convex [2, Section 2.2.3] 
has been used in (3). Also, the further constraints implied by 
the probability simplex (which is convex) are no problem since 
an intersection of convex sets is always convex [2, Section 
2.3.1]. ■ 
Since the empty set is convex, no restriction on V(px, qx) 
(e.g. V(px,qx) < e) is necessary. 

Corollary 1: Let pxy be any fixed joint probability distri- 
bution of any two two discrete finite random variables X, Y, 
let qx be any fixed probability distribution of X and let e be 
any fixed number G [0, 2]. Then, the optimization problem 



qY\X ■ V(q X qY\X,PXY)<f- 



!{(Ix<1y\x) 



(4) 



is convex. 

Proof: The mutual information I{qxqy\x) is a convex 
function of the conditional probabilities q Y \x when qx is 
fixed, and the set {py\x I V(qx<lY\x,PxY) < e} is convex. 

■ 

Corollary 1 basically says that the optimization problem 
given is practically solvable. However, since it is a general 
convex optimization problem, it can still be cumbersome to 
find a suitable algorithm with the correct parameters. Fortu- 
nately the problem can be restated in such a way, that it can 
be handled by disciplined convex programming (DCP) [3], 
which works perfectly well for this problem as can be seen in 
section V. 

The minimization problem in Corollary 1 can not be solved 
in a straightforward manner with DCP, since this would 
violate the no product rule of DCP (see (1), (2)), also there 
is no built function in CVX (which is the software which 
implements DCP) for the mutual information as a function of 
the conditional probabilities when the corresponding marginal 



probability is fixed. Therefore the relative entropy, which is 
a built in function in CVX and is convex in its two input 
arguments, is used. Then it can be seen that 

I(X;Y) = I(qxqY\x) = D{q x qY\x\\qxqY), 

and qx{i)qY\x{j\i) are affine functions of qY\x{jV) as 
qx(i)qY(j) = qx(i)(J2tqY\x(j\i)qx(i)) are. Hence, the 
convexity of D(-, •) is preserved [2, section 2.3.2], and it 
is straightforward to implement the minimization problem in 
Corollary 1 with CVX with this knowledge. 

Next the second subproblem, namely the minimization of 
the mutual information over the marginal probability distribu- 
tion qx, is solved. Herefore it is first shown that 

V(q x ,Px) = \\qx 



X! \lx{i) ~Px(i 



i=i 

E 



^2(qxy(i,j) ~PxY(i,j)) 



M X My 

<^2^2\qxY(i,j) ~PXY(i,j)\ 

i=l = l 
= V(q X Y,PXY) 
< €. 



(5) 



Therefore only qx with V(qx,Px) < £ have to be considered. 
Until here all results are applicable to any finite M x , but from 
here the restriction M x = 2 applies. In this case qx is one 
dimensional obviously, and the set of all qx is simply {qx = 
{min(p x (l) + 7 ,l),max(pjr(2) -7,0)} | 7 G [-§, f]}. 
Practically, the minimization problem 



min I(qxy) 

qxY ■ V{qxY,PXY)<t 



(6) 



is then simply solved by generating sufficiently many qx 
equidistantly in 7, solve the optimization problem of Corol- 
lary 1 for every qx and return the smallest mutual information 
calculated that way. Here the number of qxs is considered 
to be sufficient if one gets a smooth graph for the mutual 
information minimized over the conditional probabilities qy\x 
as a function of 7. 

IV. Discussion 

Together with the bound on the probability of a maximal 
variational distance between the true joint distribution and an 
empirical joint distribution (see [6], and especially an refine- 
ment of it which drops the dependence on the true distribution 
[4, Lemma 3]) the given bound can be used to construct a 
reasonably tight lower bound of the confidence interval for 
mutual information. Such an application can be found in [8]. 
In mutual information estimation with confidence intervals, the 
bound given is especially useful, when the marginal probability 
distribuition is far from being uniform. Such a situation can 
be found in [7]. In the case of two binary random variables 
the results seem to coincide with lower bound of [5]. 



V. Numerical examples 

In the first example (Fig. 1) a distribution pxy and a 
maximal variational distance e was handpicked to show that 
the mutual information minimized over the transitional prob- 
abilies qy\x as a function of 7 is neither convex nor concave 
(even for two binary random variables) and seems to be 
not differentiable at 7 = 0, as can be seen in Fig. 1. The 
parameters chosen therefore are 

P xy(1, 1) = 0.017, p XY (l, 2) = 0.285 
PXY (2, 1) = 0.424,^(2, 2) = 0.274 
and e = 0.3. 

Then, 

I(pxy) ~ 0.2210 and 

min I(qxY) ~ 0.0019. 

qxY ■ V(q X Y ,Pxy)<£ 

In all figures / is equal to the minimum of I(qxQY\x) over 
q Y \x for fixed q x = {min(px(l)+7, 1), max(p x (2) -7, 0)}, 
constrained by V(qxqy\x ,Pxy) < e , an d 1000 points were 
generated equidistantly for 7 e [— f , §]. 
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Fig. 1. 
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Fig. 2. 



In the last example (Fig. 3) M y = 10 and the following 
joint distribution was chosen at random (rounded for easier 
reproducibility) 

P xy(1, 1) = 0.101, p XY (l, 2) = 0.062,p xy (l, 3) = 0.025, 
Pxy {1, 4) = 0.088,pxr(l, 5) = 0.005,p xr (l, 6) = 0.007, 
P xy(1, 7) = 0.069,pxr(l, 8) = 0.059,^(1, 9) = 0.080, 
Pxy = 0-074, 

PXY {2, 1) = 0.103, PXY {2, 2) = 0.006, Px y{2, 3) = 0.038, 
PXY {2, 4) - 0.002, Pxy (2, 5) = 0.018,p X y(2, 6) = 0.079, 
PXY {2, 7) = 0.049,p X y(2, 8) = 0.032,^(2, 9) = 0.020, 
PXY {2, 10) = 0.020, 
and e = 0.1. 

Then, 

/(pxr) ~ 0.1311 and 

min /(<7xy) « 0.0369. 



In the second example (Fig. 2) M y = 5 and the following 
joint distribution was chosen at random (rounded for easier 
reproducibility) 

p XY {l, 1) - 0.090,pxy(1, 2) = 0.098,p*y(1, 3) = 0.207, 

Pxy{1A) - 0.064,pxy(1, 5) = 0.026, 

PXY {2, 1) = 0.239,pxy(2, 2) = 0.030,p xy (2, 3) = 0.104, 

Pxy(2, 4) = 0.107,pxy(2, 5) = 0.035, 

and e = 0.1. 

Then, 

I {Pxy) ~ 0.1112 and 

min I{qxv) « 0.0524. 

qxv ■ V(q X Y ,Pxy)<£ 
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